NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Ambient Diffusion Posterior Sampling: Solving Inverse Problems with Diffusion Models Trained on Corrupted Data

Aali, Asad; Daras, Giannis; Levac, Brett; Kumar, Sidharth; Dimakis, Alexandros G; Tamir, Jonathan I (April 2025, ICLR)

We provide a framework for solving inverse problems with diffusion models learned from linearly corrupted data. Firstly, we extend the Ambient Diffusion framework to enable training directly from measurements corrupted in the Fourier domain. Subsequently, we train diffusion models for MRI with access only to Fourier sub- sampled multi-coil measurements at acceleration factors R= 2,4,6,8. Secondly, we propose Ambient Diffusion Posterior Sampling (A-DPS), a reconstruction al- gorithm that leverages generative models pre-trained on one type of corruption (e.g. image inpainting) to perform posterior sampling on measurements from a different forward process (e.g. image blurring). For MRI reconstruction in high acceleration regimes, we observe that A-DPS models trained on subsampled data are better suited to solving inverse problems than models trained on fully sampled data. We also test the efficacy of A-DPS on natural image datasets (CelebA, FFHQ, and AFHQ) and show that A-DPS can sometimes outperform models trained on clean data for several image restoration tasks in both speed and performance.
more » « less
Free, publicly-accessible full text available April 24, 2026
Infilling Score: A Pretraining Data Detection Algorithm for Large Language Models

Raoof, Negin; Rout, Litu; Daras, Giannis; Sanghavi, Sujay; Caramanis, Constantine; Shakkottai, Sanjay; Dimakis, Alex (March 2025, ICLR 2025)

In pretraining data detection, the goal is to detect whether a given sentence is in the dataset used for training a Large Language Model LLM). Recent methods (such as Min-K % and Min-K%++) reveal that most training corpora are likely contaminated with both sensitive content and evaluation benchmarks, leading to inflated test set performance. These methods sometimes fail to detect samples from the pretraining data, primarily because they depend on statistics composed of causal token likelihoods. We introduce Infilling Score, a new test-statistic based on non-causal token likelihoods. Infilling Score can be computed for autoregressive models without re-training using Bayes rule. A naive application of Bayes rule scales linearly with the vocabulary size. However, we propose a ratio test-statistic whose computation is invariant to vocabulary size. Empirically, our method achieves a significant accuracy gain over state-of-the-art methods including Min-K%, and Min-K%++ on the WikiMIA benchmark across seven models with different parameter sizes. Further, we achieve higher AUC compared to reference-free methods on the challenging MIMIR benchmark. Finally, we create a benchmark dataset consisting of recent data sources published after the release of Llama-3; this benchmark provides a statistical baseline to indicate potential corpora used for Llama-3 training.
more » « less
Free, publicly-accessible full text available March 26, 2026
Consistent Diffusion Meets Tweedie: Training Exact Ambient Diffusion Models with Noisy Data

Daras, Giannis; Dimakis, Alex; Daskalakis, Constantinos (July 2024, Proceedings of the 41st International Conference on Machine Learning (ICML))

Full Text Available
Consistent Diffusion Meets Tweedie: Training Exact Ambient Diffusion Models with Noisy Data

Daras, Giannis; Dimakis, Alex; Daskalakis, Constantinos (July 2024, Proceedings of the 41st International Conference on Machine Learning (ICML))

Full Text Available
Consistent Diffusion Meets Tweedie: Training Exact Ambient Diffusion Models with Noisy Data

Daras, Giannis; Dimakis, Alexandros G; Daskalakis, Constantinos (June 2024, Proceedings of Machine Learning Research)

Full Text Available
Consistent Diffusion Models: Mitigating Sampling Drift by Learning to be Consistent

Daras, Giannis; Dagan, Yuval; Dimakis, Alex; Daskalakis, Constantinos (December 2023, NeurIPS 2023)

Full Text Available
Ambient Diffusion: Learning Clean Distributions from Corrupted Data

Daras, Giannis; Shah, Kulin; Dagan, Yuval; Gollakota, Aravind; Dimakis, Alex; Klivans, Adam R (December 2023, NeurIPS 2023)

Full Text Available
Restoration-degradation beyond linear diffusions: A non-asymptotic analysis for ddim-type samplers

Chen, Sitan; Daras, Giannis; Dimakis, Alex (July 2023, ICML 2023)

Full Text Available
Ambient Diffusion: Learning Clean Distributions from Corrupted Data

Daras, Giannis; Shah, Kulin; Dagan, Yuval; Gollakota, Aravind; Dimakis, Alexandros G; Klivans, Adam (January 2023, Advances in neural information processing systems)

Full Text Available
DataComp-LM: In search of the next generation of training sets for language models

Li, Jeffrey; Fang, Alex; Smyrnis, Georgios; Ivgi, Maor; Jordan, Matt; Gadre, Samir; Bansal, Hritik; Guha, Etash; Keh, Sedrick; Arora, Kushal; et al (April 2025, https://doi.org/10.48550/arXiv.2406.11794)

The authors introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments aimed at improving language models. DCLM provides a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants can experiment with dataset curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline, the authors find that model-based filtering is critical for assembling a high-quality training set. Their resulting dataset, DCLM-Baseline, enables training a 7B parameter model from scratch to achieve 64% 5-shot accuracy on MMLU with 2.6T training tokens. This represents a 6.6 percentage point improvement over MAP-Neo (the previous state-of-the-art in open-data LMs), while using 40% less compute. The baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% and 66%), and performs similarly on an average of 53 NLU tasks, while using 6.6x less compute than Llama 3 8B. These findings emphasize the importance of dataset design for training LMs and establish a foundation for further research on data curation.
more » « less
Free, publicly-accessible full text available April 21, 2026

« Prev Next »

Search for: All records